{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "markdown"
    }
   },
   "source": [
    "## Introduction to Semantic Chunking\n",
    "Text chunking is an essential step in Retrieval-Augmented Generation (RAG), where large text bodies are divided into meaningful segments to improve retrieval accuracy.\n",
    "Unlike fixed-length chunking, semantic chunking splits text based on the content similarity between sentences.\n",
    "\n",
    "### Breakpoint Methods:\n",
    "- **Percentile**: Finds the Xth percentile of all similarity differences and splits chunks where the drop is greater than this value.\n",
    "- **Standard Deviation**: Splits where similarity drops more than X standard deviations below the mean.\n",
    "- **Interquartile Range (IQR)**: Uses the interquartile distance (Q3 - Q1) to determine split points.\n",
    "\n",
    "This notebook implements semantic chunking **using the percentile method** and evaluates its performance on a sample text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the Environment\n",
    "We begin by importing necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fitz\n",
    "import os\n",
    "import numpy as np\n",
    "import json\n",
    "from openai import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Text from a PDF File\n",
    "To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Understanding Artificial Intelligence \n",
      "Chapter 1: Introduction to Artificial Intelligence \n",
      "Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \n",
      "to perform tasks commonly associated with intelligent beings. The term is frequently applied to \n",
      "the project of developing systems endowed with the intellectual processes characteristic of \n",
      "humans, such as the ability to reason, discover meaning, generalize, or learn from past \n",
      "experience. Over the past f\n"
     ]
    }
   ],
   "source": [
    "def extract_text_from_pdf(pdf_path):\n",
    "    \"\"\"\n",
    "    Extracts text from a PDF file.\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file.\n",
    "\n",
    "    Returns:\n",
    "    str: Extracted text from the PDF.\n",
    "    \"\"\"\n",
    "    # Open the PDF file\n",
    "    mypdf = fitz.open(pdf_path)\n",
    "    all_text = \"\"  # Initialize an empty string to store the extracted text\n",
    "    \n",
    "    # Iterate through each page in the PDF\n",
    "    for page in mypdf:\n",
    "        # Extract text from the current page and add spacing\n",
    "        all_text += page.get_text(\"text\") + \" \"\n",
    "\n",
    "    # Return the extracted text, stripped of leading/trailing whitespace\n",
    "    return all_text.strip()\n",
    "\n",
    "# Define the path to the PDF file\n",
    "pdf_path = \"data/AI_Information.pdf\"\n",
    "\n",
    "# Extract text from the PDF file\n",
    "extracted_text = extract_text_from_pdf(pdf_path)\n",
    "\n",
    "# Print the first 500 characters of the extracted text\n",
    "print(extracted_text[:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the OpenAI API Client\n",
    "We initialize the OpenAI client to generate embeddings and responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the OpenAI client with the base URL and API key\n",
    "client = OpenAI(\n",
    "    base_url=\"https://api.studio.nebius.com/v1/\",\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\")  # Retrieve the API key from environment variables\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Sentence-Level Embeddings\n",
    "We split text into sentences and generate embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 257 sentence embeddings.\n"
     ]
    }
   ],
   "source": [
    "def get_embedding(text, model=\"BAAI/bge-en-icl\"):\n",
    "    \"\"\"\n",
    "    Creates an embedding for the given text using OpenAI.\n",
    "\n",
    "    Args:\n",
    "    text (str): Input text.\n",
    "    model (str): Embedding model name.\n",
    "\n",
    "    Returns:\n",
    "    np.ndarray: The embedding vector.\n",
    "    \"\"\"\n",
    "    response = client.embeddings.create(model=model, input=text)\n",
    "    return np.array(response.data[0].embedding)\n",
    "\n",
    "# Splitting text into sentences (basic split)\n",
    "sentences = extracted_text.split(\". \")\n",
    "\n",
    "# Generate embeddings for each sentence\n",
    "embeddings = [get_embedding(sentence) for sentence in sentences]\n",
    "\n",
    "print(f\"Generated {len(embeddings)} sentence embeddings.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating Similarity Differences\n",
    "We compute cosine similarity between consecutive sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cosine_similarity(vec1, vec2):\n",
    "    \"\"\"\n",
    "    Computes cosine similarity between two vectors.\n",
    "\n",
    "    Args:\n",
    "    vec1 (np.ndarray): First vector.\n",
    "    vec2 (np.ndarray): Second vector.\n",
    "\n",
    "    Returns:\n",
    "    float: Cosine similarity.\n",
    "    \"\"\"\n",
    "    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))\n",
    "\n",
    "# Compute similarity between consecutive sentences\n",
    "similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing Semantic Chunking\n",
    "We implement three different methods for finding breakpoints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_breakpoints(similarities, method=\"percentile\", threshold=90):\n",
    "    \"\"\"\n",
    "    Computes chunking breakpoints based on similarity drops.\n",
    "\n",
    "    Args:\n",
    "    similarities (List[float]): List of similarity scores between sentences.\n",
    "    method (str): 'percentile', 'standard_deviation', or 'interquartile'.\n",
    "    threshold (float): Threshold value (percentile for 'percentile', std devs for 'standard_deviation').\n",
    "\n",
    "    Returns:\n",
    "    List[int]: Indices where chunk splits should occur.\n",
    "    \"\"\"\n",
    "    # Determine the threshold value based on the selected method\n",
    "    if method == \"percentile\":\n",
    "        # Calculate the Xth percentile of the similarity scores\n",
    "        threshold_value = np.percentile(similarities, threshold)\n",
    "    elif method == \"standard_deviation\":\n",
    "        # Calculate the mean and standard deviation of the similarity scores\n",
    "        mean = np.mean(similarities)\n",
    "        std_dev = np.std(similarities)\n",
    "        # Set the threshold value to mean minus X standard deviations\n",
    "        threshold_value = mean - (threshold * std_dev)\n",
    "    elif method == \"interquartile\":\n",
    "        # Calculate the first and third quartiles (Q1 and Q3)\n",
    "        q1, q3 = np.percentile(similarities, [25, 75])\n",
    "        # Set the threshold value using the IQR rule for outliers\n",
    "        threshold_value = q1 - 1.5 * (q3 - q1)\n",
    "    else:\n",
    "        # Raise an error if an invalid method is provided\n",
    "        raise ValueError(\"Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.\")\n",
    "\n",
    "    # Identify indices where similarity drops below the threshold value\n",
    "    return [i for i, sim in enumerate(similarities) if sim < threshold_value]\n",
    "\n",
    "# Compute breakpoints using the percentile method with a threshold of 90\n",
    "breakpoints = compute_breakpoints(similarities, method=\"percentile\", threshold=90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting Text into Semantic Chunks\n",
    "We split the text based on computed breakpoints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of semantic chunks: 231\n",
      "\n",
      "First text chunk:\n",
      "Understanding Artificial Intelligence \n",
      "Chapter 1: Introduction to Artificial Intelligence \n",
      "Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \n",
      "to perform tasks commonly associated with intelligent beings.\n"
     ]
    }
   ],
   "source": [
    "def split_into_chunks(sentences, breakpoints):\n",
    "    \"\"\"\n",
    "    Splits sentences into semantic chunks.\n",
    "\n",
    "    Args:\n",
    "    sentences (List[str]): List of sentences.\n",
    "    breakpoints (List[int]): Indices where chunking should occur.\n",
    "\n",
    "    Returns:\n",
    "    List[str]: List of text chunks.\n",
    "    \"\"\"\n",
    "    chunks = []  # Initialize an empty list to store the chunks\n",
    "    start = 0  # Initialize the start index\n",
    "\n",
    "    # Iterate through each breakpoint to create chunks\n",
    "    for bp in breakpoints:\n",
    "        # Append the chunk of sentences from start to the current breakpoint\n",
    "        chunks.append(\". \".join(sentences[start:bp + 1]) + \".\")\n",
    "        start = bp + 1  # Update the start index to the next sentence after the breakpoint\n",
    "\n",
    "    # Append the remaining sentences as the last chunk\n",
    "    chunks.append(\". \".join(sentences[start:]))\n",
    "    return chunks  # Return the list of chunks\n",
    "\n",
    "# Create chunks using the split_into_chunks function\n",
    "text_chunks = split_into_chunks(sentences, breakpoints)\n",
    "\n",
    "# Print the number of chunks created\n",
    "print(f\"Number of semantic chunks: {len(text_chunks)}\")\n",
    "\n",
    "# Print the first chunk to verify the result\n",
    "print(\"\\nFirst text chunk:\")\n",
    "print(text_chunks[0])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Embeddings for Semantic Chunks\n",
    "We create embeddings for each chunk for later retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embeddings(text_chunks):\n",
    "    \"\"\"\n",
    "    Creates embeddings for each text chunk.\n",
    "\n",
    "    Args:\n",
    "    text_chunks (List[str]): List of text chunks.\n",
    "\n",
    "    Returns:\n",
    "    List[np.ndarray]: List of embedding vectors.\n",
    "    \"\"\"\n",
    "    # Generate embeddings for each text chunk using the get_embedding function\n",
    "    return [get_embedding(chunk) for chunk in text_chunks]\n",
    "\n",
    "# Create chunk embeddings using the create_embeddings function\n",
    "chunk_embeddings = create_embeddings(text_chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performing Semantic Search\n",
    "We implement cosine similarity to retrieve the most relevant chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def semantic_search(query, text_chunks, chunk_embeddings, k=5):\n",
    "    \"\"\"\n",
    "    Finds the most relevant text chunks for a query.\n",
    "\n",
    "    Args:\n",
    "    query (str): Search query.\n",
    "    text_chunks (List[str]): List of text chunks.\n",
    "    chunk_embeddings (List[np.ndarray]): List of chunk embeddings.\n",
    "    k (int): Number of top results to return.\n",
    "\n",
    "    Returns:\n",
    "    List[str]: Top-k relevant chunks.\n",
    "    \"\"\"\n",
    "    # Generate an embedding for the query\n",
    "    query_embedding = get_embedding(query)\n",
    "    \n",
    "    # Calculate cosine similarity between the query embedding and each chunk embedding\n",
    "    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]\n",
    "    \n",
    "    # Get the indices of the top-k most similar chunks\n",
    "    top_indices = np.argsort(similarities)[-k:][::-1]\n",
    "    \n",
    "    # Return the top-k most relevant text chunks\n",
    "    return [text_chunks[i] for i in top_indices]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: What is 'Explainable AI' and why is it considered important?\n",
      "Context 1:\n",
      "\n",
      "Explainable AI (XAI) \n",
      "Explainable AI (XAI) aims to make AI systems more transparent and understandable. Research in \n",
      "XAI focuses on developing methods for explaining AI decisions, enhancing trust, and improving \n",
      "accountability.\n",
      "========================================\n",
      "Context 2:\n",
      "\n",
      "Transparency and Explainability \n",
      "Transparency and explainability are essential for building trust in AI systems. Explainable AI (XAI) \n",
      "techniques aim to make AI decisions more understandable, enabling users to assess their \n",
      "fairness and accuracy.\n",
      "========================================\n"
     ]
    }
   ],
   "source": [
    "# Load the validation data from a JSON file\n",
    "with open('data/val.json') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# Extract the first query from the validation data\n",
    "query = data[0]['question']\n",
    "\n",
    "# Get top 2 relevant chunks\n",
    "top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)\n",
    "\n",
    "# Print the query\n",
    "print(f\"Query: {query}\")\n",
    "\n",
    "# Print the top 2 most relevant text chunks\n",
    "for i, chunk in enumerate(top_chunks):\n",
    "    print(f\"Context {i+1}:\\n{chunk}\\n{'='*40}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a Response Based on Retrieved Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the system prompt for the AI assistant\n",
    "system_prompt = \"You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'\"\n",
    "\n",
    "def generate_response(system_prompt, user_message, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates a response from the AI model based on the system prompt and user message.\n",
    "\n",
    "    Args:\n",
    "    system_prompt (str): The system prompt to guide the AI's behavior.\n",
    "    user_message (str): The user's message or query.\n",
    "    model (str): The model to be used for generating the response. Default is \"meta-llama/Llama-2-7B-chat-hf\".\n",
    "\n",
    "    Returns:\n",
    "    dict: The response from the AI model.\n",
    "    \"\"\"\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_message}\n",
    "        ]\n",
    "    )\n",
    "    return response\n",
    "\n",
    "# Create the user prompt based on the top chunks\n",
    "user_prompt = \"\\n\".join([f\"Context {i + 1}:\\n{chunk}\\n=====================================\\n\" for i, chunk in enumerate(top_chunks)])\n",
    "user_prompt = f\"{user_prompt}\\nQuestion: {query}\"\n",
    "\n",
    "# Generate AI response\n",
    "ai_response = generate_response(system_prompt, user_prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the AI Response\n",
    "We compare the AI response with the expected answer and assign a score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Based on the evaluation criteria, I would assign a score of 0.5 to the AI assistant's response.\n",
      "\n",
      "The response is partially aligned with the true response, as it correctly identifies the main goal of Explainable AI (XAI) as making AI systems more transparent and understandable. However, it lacks some key details and nuances present in the true response. For example, the true response mentions the importance of assessing fairness and accuracy, which is not explicitly mentioned in the AI assistant's response. Additionally, the true response uses more precise language, such as \"providing insights into how they make decisions,\" which is not present in the AI assistant's response.\n"
     ]
    }
   ],
   "source": [
    "# Define the system prompt for the evaluation system\n",
    "evaluate_system_prompt = \"You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5.\"\n",
    "\n",
    "# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt\n",
    "evaluation_prompt = f\"User Query: {query}\\nAI Response:\\n{ai_response.choices[0].message.content}\\nTrue Response: {data[0]['ideal_answer']}\\n{evaluate_system_prompt}\"\n",
    "\n",
    "# Generate the evaluation response using the evaluation system prompt and evaluation prompt\n",
    "evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)\n",
    "\n",
    "# Print the evaluation response\n",
    "print(evaluation_response.choices[0].message.content)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-new-specific-rag",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
